Introductory cleaning
This report was written with the goal of analysing HELOC dataset and preparing it to applying machine learning algorithms. The most important information that needs to be clarified beforehand is that the original dataset was randomly split into three separate datasubsets (training, tuning and testing) in 3:1:1 proportions. As the original data had nearly 10 000 observations, one can easily compute rough number of rows in each datasubset.
We begin with reading .csv file with training set. This is the only one we will be analysing in this report, as we should have no prior knowledge about the other two before using them in tuning/testing respectably.
df <- readr::read_csv("dataset/df_train.csv", col_types = cols(
RiskPerformance = col_factor(levels = c("Good", "Bad")),
ExternalRiskEstimate = col_double(),
MSinceOldestTradeOpen = col_double(),
MSinceMostRecentTradeOpen = col_double(),
AverageMInFile = col_double(),
NumSatisfactoryTrades = col_double(),
NumTrades60Ever2DerogPubRec = col_double(),
NumTrades90Ever2DerogPubRec = col_double(),
PercentTradesNeverDelq = col_double(),
MSinceMostRecentDelq = col_double(),
MaxDelq2PublicRecLast12M = col_double(),
MaxDelqEver = col_double(),
NumTotalTrades = col_double(),
NumTradesOpeninLast12M = col_double(),
PercentInstallTrades = col_double(),
MSinceMostRecentInqexcl7days = col_double(),
NumInqLast6M = col_double(),
NumInqLast6Mexcl7days = col_double(),
NetFractionRevolvingBurden = col_double(),
NetFractionInstallBurden = col_double(),
NumRevolvingTradesWBalance = col_double(),
NumInstallTradesWBalance = col_double(),
NumBank2NatlTradesWHighUtilization = col_double(),
PercentTradesWBalance = col_double()
))
General look at variables
We had some knowledge of the data before, so we knew that – other than the target column RiskPerformance – every variable is numeric. As such, we can have a look at pair plot, where every combination of two variables is displayed.
At the first glance one can easily notice that this plot is not very readable. In the folder containing this report, however, one can find pairs_plot.png with original resolution that may be enlarged and actually read.
Missing data
We didn’t use the obvious choice of checking missing data statistics with built-in functions of DataExplorer library because we knew that NA-s are encoded in this dataset differently. To clarify the issue, all non-missing data here is expressed with non-negative numbers, while there are three kinds of NA-s, each having its own negative value of -7, -8 or -9 and different meaning. They also appear at different frequencies thorough the variables, though they actually sum up to similar number each.
It may be valuable to take a look at the distributions of NA-s across variables, so there’s a plot to help with illustrating the issue.
Note that MSinceMostRecentDelq column shows the highest fraction of complicated NA structure.
Now that we have seen a graphical representation, let’s move on to the description, which will say us a bit more about the meaning of each NA code. As we said before, -9 means no credit history and/or score available and can be considered the “true” NA, as it is a typical example of no data. -8 is used for cases when there is no usable data for this record, e.g. if this person was inactive for a year. Then, -7 is used when condition is not met, as in someone have never had an inquiry before the last 7 days (second row, last column in the plot above) and we ask for the days since such an event.
There are rows that contain only -9s (not counting target column), so it’s not very informative. Let’s drop these rows.
df <- df[!apply(df[, -1], 1, function(x) all(x == -9)), ]
Now we need to check in which columns there are -9s left.
names(which(apply(df[, -1], 2, function(x) any(x == -9))))
## [1] "ExternalRiskEstimate"
Luckily, only one column contains -9s after our first cleaning step. These are records for which nobody ever made a risk estimate. They are not very numerous (only 5 cases, actually, less than 0.1% of the data), so we can make an assumption that these estimates would be somewhere around median. The code below substitutes median value of this column for aforementioned cases.
df[df$ExternalRiskEstimate == -9, 2] <- median(df$ExternalRiskEstimate, na.rm = TRUE)
Now the plot looks similar to the previous, but we skipped -9s, as there are none left in the dataset. Non-zero percentages grew a bit, because we deleted some rows, but overall proportions remained the same as before.
Overlapping columns
Remember that gigantic plot of every possible column pair? There was more than only points on grahps. It contained calculated correlation between these pairs and we can use it now to decide which columns to drop. But first, let’s plot it again after some cleaning, as the results might have been altered a bit.
knitr::include_graphics("good_plots/zoom.gif")


If we look closely and thoroughly, we can find two pairs of columns with high correlation values. Their names are also an indicator of correlation. After several rounds of rock-paper-scissors a decision was made to: * exclude NumTrades60Ever2DerogPubRec and leave NumTrades90Ever2DerogPubRec (0.897 correlation), * exclude NumInqLast6M and leave NumInqLast6Mexcl7days (0.992 correlation).
df <- drop_columns(df, c("NumTrades60Ever2DerogPubRec", "NumInqLast6M"))
Back to other NAs
We got rid of -9s in our dataset, but there are still two other NA codes. Before treating them, we need to remind us of which columns contain our NAs.
(na_columns <- names(which(apply(df[, -1], 2, function(x) any(x %in% c(-7, -8))))))
## [1] "MSinceOldestTradeOpen"
## [2] "MSinceMostRecentDelq"
## [3] "MSinceMostRecentInqexcl7days"
## [4] "NetFractionRevolvingBurden"
## [5] "NetFractionInstallBurden"
## [6] "NumRevolvingTradesWBalance"
## [7] "NumInstallTradesWBalance"
## [8] "NumBank2NatlTradesWHighUtilization"
## [9] "PercentTradesWBalance"
What seemed important to us was the impact of these 9 variables with NA on target variable, so we made use of boxplots to get some idea.
Basing on the plots above and correlation plot between all variables (just a moment before) we decided to drop NumInstallTradesWBalance column which had about 10% of -8s and almost identical boxes regardless of chosen RiskPerformance category.
df <- drop_columns(df, "NumInstallTradesWBalance")
na_columns <- na_columns[na_columns != "NumInstallTradesWBalance"]
Conversion of a column with multiple NA types into two columns
Encoding of missing data used in our dataset is, honestly speaking, quite inconvenient. We decided to remedy that by splitting necessary columns into two separate ones:
- old numerical column with one general NA
- new logical column dependent on type of NA
The second column is created with answering following question in mind: “Does considered entity exist?”. As such, we encoded it with FALSE for -7s, as it indicates impossibility of finding requested entity, and TRUE for -8s and non-missing data. We would also set NAs in this column for -9s, but there were none left by this time. Quite obviously, we dropped this column every time it contained only truths, as it would tell us nothing really.
This type of conversion was applied on two columns with mixed NAs – MSinceMostRecentInqexcl7days and MSinceMostRecentDelq – and six more with -8s only.
|
MSinceMostRecentDelq
|
MSinceMostRecentInqexcl7days
|
|
22
|
-7
|
|
-7
|
0
|
|
0
|
2
|
|
32
|
0
|
|
-7
|
0
|
|
3
|
4
|
df <- na_process(df, "MSinceMostRecentDelq", "HadDelq")
df <- na_process(df, "MSinceMostRecentInqexcl7days", "HadInqexcl7days")
df <- na_process(df, "MSinceOldestTradeOpen")
df <- na_process(df, "NetFractionRevolvingBurden")
df <- na_process(df, "NetFractionInstallBurden")
df <- na_process(df, "NumRevolvingTradesWBalance")
df <- na_process(df, "NumBank2NatlTradesWHighUtilization")
df <- na_process(df, "PercentTradesWBalance")
|
MSinceMostRecentDelq
|
HadDelq
|
MSinceMostRecentInqexcl7days
|
HadInqexcl7days
|
|
22
|
TRUE
|
NA
|
FALSE
|
|
NA
|
FALSE
|
0
|
TRUE
|
|
0
|
TRUE
|
2
|
TRUE
|
|
32
|
TRUE
|
0
|
TRUE
|
|
NA
|
FALSE
|
0
|
TRUE
|
|
3
|
TRUE
|
4
|
TRUE
|
Coefficient of determination
vtreat R^2 for all variables. 